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Abstract 

The purpose of the present study was to apply the percentage of data points exceeding the median of baseline 
phase (PEM) approach for a meta-analysis of single-case experiments to compare the relative effectiveness of 
different kinds of reinforcers used in behavior modification. Altogether 153 studies were located, which produced 
1091 effect sizes. The grand mean of the 153 PEM scores was .92. The main finding was that among the positive 
reinforcers, “activities” was the most effective (PEM score = .95) while “objects” and “edibles” were relatively less 
effective (PEM score = .85 and .83, respectively). The results of the present study are of importance in respect to the 
choice of reinforcer for use in intervention in behavior modification. 
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Introduction 

In educational settings, it is desired that all students know the value of knowledge and skills and are 
intrinsically motivated to learn, but it is rarely the case. Therefore, teachers in school settings often have 
to employ extrinsic reinforcers to motivate students to learn. After the frequent use of reinforcement on 
good school performance, it can be expected that the act of learning will become a conditioned response, 
and after the teacher applies intermittent reinforcement or the student acquires natural consequence of 
success in education or carrier, it is hoped that the learning can become intrinsically motivated with less 
extrinsic reinforcement. 

Reinforcement can be classified into four kinds: (a) positive reinforcement (giving positive 
reinforcer), (b) punishment (giving negative reinforcer), (c) punishment (withdrawing positive reinforcer), 
and (d) negative reinforcement (withdrawing negative reinforcer). The present study concentrated mostly 
on the comparison of the effectiveness of positive reinforcers including edible foods, tangible objects, 
activities, and tokens. Less attention is paid to the effectiveness of negative (aversive) reinforcers, but the 
mean effect sizes of punishment were presented for the purpose of comparison. 

There have been many studies reporting success in the use of primary rei nf orcers to modify the 
behavior of participants (e.g., Fomess, Kavale, Blum, and Lloyd, 1997; Kem, Ringdahl, Hilt, and 
Sterling- Turner, 2001; Osborne, 1969; and Williams, Koegel, andEgel, 1981). 
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Cameron and Pierce (1994) conducted a meta -analysis to address the issue of whether extrinsic 
reinforcement is harmful to the intrinsic motivation and found that rewards given for task completion or 
for quality of performance are not detrimental to intrinsic motivation. 

According to the principle of behavior modification, in order to expect a desirable behavior to happen 
in the future, three conditions must be fulfilled as follows: a discriminative stimulus must be present; 
there must be a contingency for reinforcement of the target behavior, and the reinforcer must be able to 
satisfy the need of the individual. Researchers in behavior analysis have paid more attention to the third 
condition recently. Neef and Lutz (2001) found that the effect of more preferred reinforcers was higher 
than that of less preferred reinforcers. The results of Pace, Ivancic, Edwards, Iwata, and Page’s (1985) 
study confirmed that the success of reinforcement depends on the selection of suitable reinforcement 
schedules and contingencies. Glynn (1970) found that the effect of self-determined and experimenter- 
determined token intervention on the learning of history and geography material was superior to that of 
chance-determined and no-token interventions . 

It may be alternatively hypothesized that the effect size of an intervention would be larger when the 
reinforcer can better meet the needs of the participant and serve as a mechanism to increase his or her 
motivation. 

The tool used to measure the effectiveness of different reinforcers was the percentage of data points 
exceeding the median of baseline phase (PEM) approach (Ma, 2006). By far, the most widely used 
method for measuring the effect size from single -case experimental designs is the percentage of 
nonoverlapping data (PND) approach proposed by Mastropieri and Scmggs (1985-1986). The strength 
and weakness of the PEM approach and its superiority over the PND method has been discussed by Ma 
(2006) and empirically confirmed by Gao and Ma (2006) ; Chen and Ma (2007) ; Ma (2009) and Preston 
and Carter (2009) . Therefore, it was decided to apply the PEM approach to compare the relative 
effectiveness of different reinforcers used in the field of behavior modification. 

Method 

Procedures for Locating Studies 

The single -case experimental studies investigating the effect of reinforcers analyzed in this synthesis 
were obtained through a computer-assisted search of the relevant databases, including EBSCOhost, ERIC, 
and ProQuest. There was a large overlap of studies located from the databases of EBSCOhost and 
ProQuest. Descriptors included token economy, toten system, reinforcement, or rei nf orcer. A hand search 
of relevant journals of behavior analysis such as Journal of Applied Behavior Analysis; Behavior 
Disorders; Behavior Modification; Behavior Assessment; Behavior Therapy; Behavior, Research and 
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Therapy; and Journal of Special Education were conducted. Additionally, manual searches of the 
reference lists of articles identified were carried out to locate further studies. Studies that met the criterion 
that the data of baseline and treatment phases of a reversal or a multiple -baseline design were graphically 
displayed for individual participants in a time series format enabling the computation of PEM scores were 
included in this synthesis. Studies which employed an AB design were excluded because that such a 
design lacks internal validity and alternative interpretations of a result can not be ruled out. Altogether 
153 studies were included in the meta- analysis. The publish years of located studies ranged from 1968 to 
2006. 

Procedure for Coding a Sample Study 
Variables in each of the following areas were coded: 

1. Original author(s)’ conclusion on the overall effectiveness of an intervention: 2 = effective, including 
“highly effective”, “successful intervention”, “all data points of inappropriate behavior during the 
treatment phase were below the mean that occurred during the baseline phase”, “noticeable reduction of 
inappropriate behavior”; 1 = moderately effective, including slightly effective, gradually improved”, 
“above baseline level but was unstable, was not immediately effective but effective later”; and 0 = 
questionable or not effective. Because it is hard to distinguish between questionable and no effect, they 
were combined together in this present study. 

2. Categorization of independent variables: The reinforcers were coded regardless of whether they were 
delivered by the interventionist or by the participant him- or herself in the case of an intervention with 
self -management. The coding number and operational definition of the independent variables are listed 
as following (the coding numbers are not continuous because they are numbered in accordance with 
their categories): 

11. Edibles: Providing food or varied edible reinforcers. 

12. Objects: Providing objects, such as a toy; obsession (those items that the participants continually 
sought out or verbally requested). 

13. Activities: Providing activities, such as interactive play; allowing a choice of activities; 
incorporating echolalia into a task response; presenting varied tasks instead of constant tasks ; 
choosing books or stories to be read by the experimenter; sitting in a therapy ball instead of in a 
traditional classroom chair; playing electro video games; choice of preferred game (or toy); free time 
after remaining in seat; given preferred reading material; rhythmic entertainment; music; puzzles. 

14. Giving secondary rei nf orcer: Praise; attention (making statement or physical gesture to the 
participant); nonverbal approval, such as a smile and physical contact. 

21. Giving negative reinforcers: Giving aversive stimulus including a reprimand; a stem ”No”; icing 
on facial area contingent on bmxism; over-correction; positive practice overcorrection; loud noise; 
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response blocking; shock; electric stimulation (Self -Injurious Behavior Inhibiting System); 
suppression (sharply saying “No” and briefly holding the part of the child’s body when the child 
performed self-stimulation); requiring the child to stand up and sit on the floor five to ten times 
contingent on an inappropriate behavior. 

22. Withdrawal of positive reinforcers: Time out (isolating the participant from the reinforcing 
situation); extinction (ignoring the inappropriate behavior); using earlier curfew contingent on late 
entering a residence; withdrawal of attention; escape extinction (participant could escape only after 
a completing task); brief escape from dental treatment contingent upon co-operative behavior; break 
from the task only after completion of a part of the overall task; and sensory extinction. 

23. Differential reinforcement of alternative appropriate behavior (DRA) with reinforcers other than 
tokens: Extinction of inappropriate target behavior and reinforcement of appropriate behavior. 

30. Package of positive reinforcement and punishments other than removal of token: Rei nf orcement 
contingent on desirable behavior and punishments with the exception of removal of token 
contingent on the undesirable behavior as a package. 

40. Token: Tokens which can be redeemed for back-up preferred primary reinforcers at a later point 
in time, including points, lottery, and money. 

41. Package of token reward plus withdrawal of token: Token rei nf orcement contingent on 
appropriate behavior and removal of token contingent on inappropriate behavior. 

42. Package of token reward plus punishments other than removal of token: Token reinforcement 
contingent on appropriate behavior and other kind of punishments (other than removal of token) 
contingent on inappropriate behavior. 

43. Differential reinforcement of alternative appropriate behavior (DRA) with tokens: Extinction of 
inappropriate target behavior and reinforcement of alternative appropriate behavior with tokens as a 
package. 

3. Categorization of dependent variables: Target behaviors were classified into six categories: 

51. Quality of academic behavior as measured by accuracy, including: The learning of sign language, 
making appropriate verbal responses, showing skills in matching, and fluency in speaking. 

52. Quantity of academic behavior as measured by the number of tasks completed, including: 
Instraction following, and percentage of taking medication. 

53. Socially desirable behavior: Class attendance; attending behavior in the classroom; engagement in 
interactive play; following of a dressing routine in the family; being on-task; showing compliance; 
making eye contact; making appropriate requests; being in-seat; appropriately recruiting teacher 
attention; payment of fines; and consumption of tokens. 

61. Problem behavior: Anti-social behavior; returning too late to the dormitory; dismptive behavior; 
tantmms; perseverative speech; making a noise; not attending; making inappropriate movements 
of the body; talking mdely; packing food into the mouth without swallowing; thematic ritualistic 


401 


The Behavior Analyst Today 


Consolidated Volume 10, Numbers & 4 


activities; being off-task; expulsion and refusals during mealtime; social avoidance behavior; 
aggressive behavior; stealing; and stamping. 

62. Self-injury: Bmxism; and ingesting pills. 

63. Self-stimulation: Rocking and hand-flapping; stereotyped behaviors; hand-clapping; object- 
mouthing; thumb- sucking; and excessive alcohol consumption. 

4. Settings: Intervention settings were classified as (a) home; (b) institution, including clinic and various 
therapeutic centers, laboratory, residential facility, hospital classroom, room in an institution, infirmary 
playroom, home-style rehabilitation setting, and achievement placement; (c) school, including facilities 
in different levels of school, such as classroom and cafeteria, day-care program, co-operative student 
dormitory; and (d) other places, including unspecified room or playroom, semi-naturalistic setting, 
facilities in the community, such as outdoor cafeteria, factory, recreation center, and empty meeting 
room. 

5. Interventionists: Interventionists were classified into: (a) experimenter, including treatment provider, 
facilitator; research assistant, and nonprofessional staff, educational staff, observer, and recorder; (b) 
specialist, including author, researcher, therapist; instmctor; counselor, clinician, and teaching parent; 
(c) teacher, including swimming coach and trainer; (d) tutor, including peer teacher and home tutor; and 
(e) parents, including caregivers. 

6. Participants: Participants were classified as those with: (a) Attention deficit hyperactivity disorder 
(ADHD); (b) Autism Spectmm Disorder, Asperger’s syndrome, Down Syndrome; (c) mental illness, 
including psychiatric and psychotic patients; (d) emotional or behavior disorders; (e) learning 
disabilities; (f) mental retardation of different degrees of severity including mongoloid, educable mental 
retardation, develop mentally disabled, global developmental delay, organic brain syndrome, left hemi 
paresis, a left visual field defect, brain injury, multiple handicaps, speech and language development 
delay, and developmental disabilities; (g) “typical” intelligence including participants with dismptive 
behaviors or deficient in sustaining attention, pre-delinquent behaviors, asthma, psychological 
problems, physically handicapped; and (h) deafness and hearing impairment. 

7. Age of participant. Age was divided into five groups: Below 7, 7-12, 13-15, 16-18, and beyond 18 
years old. 

8. The length of the treatment phase was coded in order to examine whether a longer treatment phase has 
a higher effect. 

9. The first pair of baseline -treatment phases and the pairs after that were coded so that the effect of the 
orthogonal slope change on the effect size of the second pair of baseline -treatment phases described by 
Scmggs, Mastropieri, and Casto (1987) could be examined. They assumed that the data points of an 
appropriate target behavior in the second baseline would have a gradual downward trend and that in the 
second treatment would have a gradual upward trend, and hence forming an orthogonal slope change. 

10. Methods of assessing the preference of reinforcers: This moderator refers to the approaches taken by 
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the original authors to choose a reinforcer that would satisfy the needs of a participant: 0 = no mention 
of assessment (it was assumed that the reinforcer was decided by the author either based on a review of 
the literature or on his or her professional judgment); 1 = the reinforcer was suggested by significant 
others of the participant such as parent or teacher (the information was gathered through interview or 
questionnaire); 2 = the reinforcer was chosen based on a functional analysis (after an informal interview 
and observation, an experimentally manipulated multi-element design was conducted to investigate the 
environmental events that led to subsequent behavior changes in order to identify the factor which 
critically reinforced the problem behavior; 3 = preference test (a list of potential rei nf orcers was 
compiled and arrayed to let the participant show his or her preference by consuming them or playing 
with them), including pair-wise comparison of preference; the reinforcer list was decided by the 
participant or formulated through discussion with the participant; the participant was able to redeem 
tokens in a “store” of back-up reinforcers; provision of choice-making opportunities in arranging the 
schedule of activities; observation of the behavior of the participant to comprehend what the participant 
preferred as reinforcers; use of the Premark principle (use of a high frequency activity as a reinforcer to 
reinforce less preferred activities); incorporating thematic ritualistic behaviors preferred by the child 
with autism into games to facilitate social play; 4. = using money as rei nf orcer. 

Compulation ofPEM scores. To compute the PEM scores, one needs only to draw a horizontal 
median line in the baseline phase. This horizontal median line will hit the median when the number of 
data points in the baseline phase is odd and fall between the two middle points if the number of data 
points is even. The median line will then stretch out horizontally to the treatment phase. Then the 
percentage of data points of treatment phase above the median line can be calculated as the effect size 
scores. The null hypothesis is that if the treatment is not effective, then the data points in the treatment 
phase would fluctuate around the stretched median line and each data point would have a probability of 
.5 above the line. If instances of the undesired behavior are expected to decrease after the intervention is 
introduced, then the PEM score will be the percentage of data points below the median line in the 
treatment phase. Eigure 1 illustrates the method of calculation and comparison of the PEM and PND 
scores with artificially generated data. It shows the effect of reinforcement consisted of access to a play 
activity on the acquisition of three elements of social skills (appropriate requesting, commenting, and 
sharing) in the kindergarten class. In the upper panel of Eigure 1, the median of the baseline is 18%, two 
data points were above and two other points below the median. Twelve data points in the treatment 
phase were above the stretched median line, hence the PEM scores was 12/15 = .8, a moderate effect 
According to the criterion set by Scmggs, Mastropieri, Cook, & Escobar (1986), a score greater or 
equal to .9 is highly effective, a score greater than or equal to .7 but less than .9 is moderately effective, 
and a score less than .7 is questionable or not effective. Because there was an outlier data point (50%) in 
the baseline phase and the highest date point in the treatment phase was also 50%, no data point in the 
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treatment phase exceeded the highest point in the baseline phase, therefore the PND score is 0/15 = .0 
(not effective). This is the reason why the PEM approach is more justifiable than the PND approach. 
The middle and the lower penal of Figure 1 show a PEM score of high and no effect of treatment, 
respectively. 



Session 


Fig. 1. The effect of reinforcement consisted of access to a play activity on the acquisition 
of three elements of social skills in the kindergarten class. 
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Reliability. One student in a doctoral program and one in a master’s program in education serving as 
part-time research assistants independently conducted the calculation and coding of 42% of the scores of 
the PEM scores and judgment of original author(s). The percentage of agreement is calculated by the 
formula: agreements / (agreements -i- disagreements). The reliabilities of coding for the reinforcer data 
were as follows: PEM scores = 344 / 449 = .77, judgment = 428 / 449 = .95. Eor the token data, PEM 
scores = 346 / 365 = .95, and judgment = .83. In order to let the reliability approach 1.00 in the last results 
of the present study, the two assistants were asked to carry out all the calculation and coding of all of the 
PEM scores and judgment scores and the present author made the last check and resolved the 
disagreements. 

Results 

In the use of statistics to analyze the data of a meta -analysis for the single -case experimental designs 
and to explain the findings, the three basic assumptions of parametric statistics (normality, independence, 
and variance homogeneity of the distribution of residuals) should not be violated; otherwise, a 
nonparametric statistic should be used, such as employing the Kruskal-WaUis analysis of variance by 
ranks to test the significance of difference between multiple groups and applying the Mann- Whitney U 
test to test that of two groups. The Levene statistic which is available in the SPSS package can be used to 
test the assumption of the variance homogeneity of the residuals. 

When a mean effect size was used to represent the effect size of each located study, the lag 1 
autocorrelation (the correlation between i and i -i- 1 of the same set of data) of -.09 with a standard error of 
.08, p > .05, depicted that the data was independent and did not violate the assumption of the independent 
distribution of the residuals, which were produced by subtracting the mean PEM score of each study from 
the grand mean of the 153 studies. A t-test for single group resulted in t(152) = 54.94, p < .01 indicating 
that the grand mean effect size of .92 was significantly different from the hypothetical .5 PEM score. 
When every effect size in each located study was used as a unit of analysis, then there were 1091 effect 
sizes from the 153 studies. The lag 1 autocorrelation of .16 with a standard error of .03, p < .05, was 
significantly different from zero, indicating that that the data violated the assumption of the independent 
distribution of the residuals. Therefore, non-parametric statistics had to be employed to analyze the 1091 
effect sizes. 

Validity 

The Spearman’s rank correlations between the judgments and the PEM scores were r(1082) = .39, p < 
.01. Table 1 exhibits that all three categories of the mean effectiveness judged by the original authors fall 
into the criteria suggested by Scmggs et al. (1986), that is, ninety percent (970 effect sizes) of treatments 
with reinforcement showed a high effectiveness with a mean effect size of .93, seven percent (78 effect 
sizes) of that was moderately effective with a mean effect size of .79, and only three percent (35 effect 
sizes) of that had no or questionable effect with a mean effect size of .33. 


405 


The Behavior Analyst Today 


Consolidated Volume 10, Numbers & 4 


Analyses of Independent Va riables, Dependent Variables, and Moderators 

The results of the analyses of independent variables, dependent variables, and moderators are 
displayed in Table 1. 

Table 1 


Analyses of independent variables, dependent variables, and moderators 


Subcatcgorics of variables with ending number 

N 

Mean 

SD 

Mean rank 

Independent Variables 

13. Aetivitics 

148 

0.95 

0.12 

624 

30. Positive reinforccr plus punishment 

25 

0.93 

0.22 

620 

21. Negative reinforeer 

140 

0.94 

0.16 

612 

40. Token 

190 

0.92 

0.19 

586 

43. DRA with token 

50 

0.91 

0.23 

565 

42. Token plus punishment 

18 

0.96 

0.05 

525 

22. Withdrawal of positive reinforeer 

92 

0.88 

0.21 

509 

23. DRA 

24 

0.86 

0.27 

503 

14. Praise 

73 

0.89 

0.17 

490 

41. Token plus removal of token 

165 

0.84 

0.30 

488 

11. Edibles 

142 

0.83 

0.29 

470 

12. Objects 

24 

0.85 

0.25 

458 

Total 

1091 

0.90 

0.22 


Dependent variables 

63. Self-stimulation 

94 

0.92 

0.20 

589 

52. Quantity of academic behaviors 

105 

0.90 

0.24 

584 

53. Social behaviors 

267 

0.92 

0.17 

556 

62. Self-injury 

71 

0.90 

0.23 

556 

61. Problem behaviors 

352 

0.90 

0.22 

543 

51. Quality of academic behaviors 

202 

0.85 

0.26 

494 
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Moderators 

Settings 

Home 

83 

0.94 

0.14 

587 

Other plaee 

77 

0.91 

0.18 

544 

Institution 

328 

0.88 

0.26 

531 

Sehool 

583 

0.90 

0.21 

531 

Interventionists 

Teaeher 

386 

0.92 

0.16 

551 

Parent 

47 

0.91 

0.16 

547 

Assistant 

358 

0.89 

0.24 

539 

Specialist 

260 

0.87 

0.26 

514 

Tutor 

20 

0.76 

0.35 

449 

Category of participants 

Deaf 

6 

1.00 

0.00 

711 

Mental patient 

34 

0.96 

0.17 

653 

Normal intelligence 

257 

0.92 

0.17 

548 

With autism 

402 

0.88 

0.25 

544 

Mental retardation 

251 

0.90 

0.21 

537 

Learning disability 

14 

0.84 

0.28 

517 

Emotional disorder 

42 

0.89 

0.24 

576 

ADHD 

78 

0.87 

0.23 

508 

Age of participant 

7-<13 

432 

0.92 

0.19 

548 

<7 

321 

0.88 

0.24 

521 

S18 

121 

0.87 

0.28 

518 

13 -<16 

152 

0.88 

0.22 

502 

16-<18 

30 

0.90 

0.23 

500 
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Order of pair of phases 


Second pair 

286 

0.91 

0.20 

577 

First pair 

805 

0.89 

0.23 

535 

Assessment of reinforccr 

Preference test 

329 

0.90 

0.24 

571 

Decided by author 

411 

0.90 

0.21 

546 

Functional analysis 

179 

0.91 

0.18 

541 

Using money 

57 

0.92 

0.16 

525 

Parent's suggestion 

115 

0.84 

0.28 

493 

Gender of participant 

Female 

308 

0.89 

0.22 

469 

Male 

666 

0.89 

0.23 

496 

Kind of design 

Reversal 

677 

0.91 

0.20 

564 

Multiple 

414 

0.87 

0.25 

516 


Note. The independent variable “punishment” refers to the punishments other than removal of token. 


The Mean Effect Size of Independent Variables 

The grand mean effect size of 1091 effect sizes was .90 with a standard deviation of .22. By testing 
the homogeneity of variances of residuals of the 12 categories of independent variables (interventions or 
treatments), a Levene statistic revealed F(1 1, 1079) = 10.97, p < .01, indicating that the assumption of the 
homogeneity of variances of residuals was also violated. Applying the nonparametric statistic 
demonstrated that the difference between the mean rank of effect sizes of 12 categories of independent 
variables was significant, Kruskal- Wa lli s analysis of variance by ranks, ?^(11,N = 1091) = 57.79, p < .01. 
The independent variables, of which the mean effect size were higher than .90 were “activities”, “token 
plus punishment”, “negative reinforcer”, “token”, “DRA with token,” and “positive rei nf orcer plus 
punishment”. The most effective reinforcer was “activities, while the relatively less (moderate) effective 
interventions were those that involved using edibles, tangible objects, and token plus removal of token. 

There were 148 effect sizes resulting from the intervention “activity”. Further analysis exhibited no 
significant difference between the mean ranks of effect size of the six dependent variables, with Kruskal- 
Wallis analysis of variance by ranks, f (5,N=148) = .77, p = .98. This result depicts that an intervention 
“activities” can be as effective when used for the establishment of desirable behaviors as for eliminating 
undesirable ones. Multiple post hoc comparisons of independent variables by means of the Mann- 
Whitney U test resulted in (13, 30, 21, 40) > (14, 41, 1 1, 12); 13 > (42, 22, 23); (21, 40) > 22; 43 > 1 1; 
and 21 >23. The numbers represent the coding number of each subcategory of independent variable in 
Table 1. The number in the parentheses refers to the fact that the mean rank of these subcategories are all 
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significantly larger than that of subcategories behind the Within the parentheses, though their mean 
ranks are arranged in ranking order, their differences are not significant from each other. For instance, 

(21, 40) > 22 represents that the mean rank of the effect sizes of “negative reinforcer” and “ token” were 
significantly higher than that of “withdrawal of positive reinforcer”, and that the mean rank of effect sizes 
of “negative reinforcer” was higher than that of “token”, but the difference was not significant. Table 1 
shows that among the positive reinforcers, “activities” was the most effective while the “edibles” and 
“objects” the relatively less effective. 

Normally, the rank order of the mean effect sizes corresponds with that of the mean ranks; however, 
inconsistency can occasionally occur due to the different number of effect sizes as well as the variability 
and outliers of effect sizes in the treatment phase of the categories to be compared. For example, the 
reason why the category “token plus punishment” showed a higher mean effect size (.96) than that of 
“activities” (.95), but showed a lower mean rank of 525 when compared to the 624 of “activities” may be 
due to the different number of outlier effect sizes of both categories. Among the 1 8 effect sizes of the 
“token plus punishment” there were 10 (55.56%) effect sizes with a PEM score of 1.00 while out of 148 
effect sizes of the “activities” there were 125 (84.46%) effect sizes with a PEM score of 1.00. 

The mean effect size of the token-reinforcement program (.89) with a standard deviation of .24 was 
slightly lower than that of immediate consumption of reinforcers (M = .90, SD = .21), but the difference 
was not significant. A Mann- Whitney U test showed Z = -.38, p = .70. This demonstrates that the 
deference of the delivery of a rei nf orcer showed only a scarce reduction in the power of the 
reinforcement, but it can also prevent the participant from becoming satiated with primary reinforcers. 

By comparing the package of “token plus punishment” with that of “token plus removal of token,” it 
was found that the former strategy was more effective than the latter one. A Mann-Whiney U test showed 
that Z = -2.1 1, p = .04, depicting that the difference was significant. The rationale may be that by 
removing a token as a punishment, the token was then associated with an aversive experience and hence 
its power of reinforcement is diminished. 

In the analysis of the assessment of preference for reinforcers, the difference between the mean ranks 
of effect sizes was on the edge of significance. A Kruskal-WaUis analysis of variance by ranks showed 
that (4,N=1091) = 8.42, p = .08. But the post hoc comparison displayed a significant difference 
between “preference test” and “parent’s suggestion” in favor of the former. The Mann- Whitney U test 
showed Z = -2.72, p < .01. 

The Mean Effect Size of Dependent Variables 
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A test by means of Kruskal-Wallis analysis of variance by ranks revealed a significant difference 
between the mean effect sizes of the six dependent variables, (5,N = 1091) = 13.88, p = .02. A Mann- 
Whiney U test showed that (63, 52, 53, 61) > 51, expressing that there was no significant difference 
among the effectiveness of interventions on the four dependent variables “self-stimulation”, “completed 
works”, “desirable social behaviors”, and “problem behaviors”, as well as that all the effectiveness on the 
four dependent variables were significantly higher than that of “works demanding accuracy”. This result 
implicates that to improve the accuracy of works is more difficult than to change other kind of dependent 
variables because the accuracy of work is a matter of “can” rather than only “will”. 

The Effect of Other Moderators 

By employing the Kmskal-Wallis analysis of variance by ranks to test whether the effectiveness of 
interventions on the dependent variables was influenced by moderators, no significant difference was 
found for the moderators of setting, interventionist, category of participants, age and gender of 
participants, implying that the effectiveness of interventions on dependent variables can be generalized to 
different settings, interventionists, categories of participants, and ages and genders of participants. 

In a test as to whether it was the case that the longer the length of the treatment the higher the 
effectiveness, a Pearson correlation coefficient was calculated and no significant correlation between the 
length of treatment phase and the effect size was found, r(1089) = -.01, p = .78. 

The only two moderators which showed significance were the order of pair of phases and the kind of 
design. The second pair of the reversal designs showed a significantly larger mean effect size than the 
first one. The Mann-Whiney U test showed that Z = -2.36, p = .02. The results in the research employing 
reversal design demonstrated a higher mean effect size than those using multiple -baseline designs. The 
Mann-Whiney U test showed that Z = -2.98, p = .01. 

Discussion 

The effectiveness in terms of mean effect size of the 12 reinforcing strategies investigated in the 
present study ranges from .83 to .96, i.e., from a moderate to large effect size as compared with the 
criterion suggested by Scmggs et al. (1986)- 

The finding that the effectiveness of “activities” and “token” were significantly higher than that of 
“edible”, and “object.” can be explained as the American participants in the relevant studies were not 
deprived personally in ordinary daily life in respect to food and object reinforcers, and thus it may not 
necessarily be possible to generalize the finding to individuals of less wealthy countries. 

The finding that there was no significant correlation between the length of treatment phase and 
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effectiveness is the same as in the results of the study conducted by Vegas, Jenson, and Kircher (2007). 
What is important for the magnitude of effectiveness is not the length of the treatment phase but the 
power of reinforcement of a rei nf orcer. The result of the present study that the mean rank of effect size of 
the second pair of baseline -treatment phases was higher than that of the first pair can serve to reduce 
concern on the orthogonal slope change mentioned by Scmggs, Mastropieri, & Casto (1987). They 
anticipated that the orthogonal slope change would threat the effect size of the second baseline treatment 
pair. But the result of the present study demonstrated that it is not the case. The larger effectiveness of 
treatment in the second pair might be attributed to the accumulated effect of the treatment. 

The result that not all positive rei nf orcers have the same effectiveness also justify the importance of 
the element of positive reinforcement that the reinforcer must satisfy the need (or reduce the deprivation) 
of the individual. This conclusion was also partially supported by the result of the present study that an 
intervention showed a higher effectiveness if the reinforcer was determined through a preference test 
rather than simply by asking for the suggestion of significant others such as the parent. The finding that 
the reinforcer “activities” had the highest mean effect size (.95) confirmed indirectly the Premark 
principle, which states that a high frequency activity can be used to reinforce a low frequency activity. 
Logically inferred, the Premark principle can be alternatively expressed as using the strong intrinsically 
motivated activity of a student as an extrinsic reinforcer to motivate him or her to learn his or her weak 
intrinsically motivated but important academic as well as social behaviors. This finding has practical 
implication. As suggested by Kem, Babara, and Fogt (2002), academic activities can be associated with 
choice-making opportunities, such as choice of activity, choice of teaching of learning materials, and 
choice of task sequence, to modify class- wide curricula. Their research demonstrated that the curricular 
modification resulted in increased levels of engagement and decreased levels of destractive behavior and 
that it can be compatible with school policy. 

The present study does not address the controversy between intrinsic and extrinsic motivation. 
However, all the studies located for this present research were conducted to improve behaviors of task- 
completion or quality related performance and were based on the assumption that the participant had 
weak intrinsic motivation. Thus the results of the present study imply that extrinsic reinforcement may 
motivate participants with a weak intrinsic motivation to improve quantitatively and/or qualitatively in 
their academic or social behaviors. 

The feasibility of the PEM approach suggests that PEM scores can be used to describe quantitatively 
in complement to visual inspection of trend and variability of the data points in the treatment phase of 
article employing a single -case experimental design and judge the effectiveness in accordance with the 
criterion set by Scruggs et al. (1986). 
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The number of studies located for use in the present investigation is still too small to further analyze 

whether different schedule, duration, intensity, or amount of reinforcement produces different effects, and 

researchers should pay attention to these issues in future studies. 
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